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Cell-lineage-specific transcripts are essential for differentiated tissue function, implicated in hereditary organ failure, and 
mediate acquired chronic diseases. However, experimental identification of cell-lineage-specific genes in a genome-scale 
manner is infeasible for most solid human tissues. We developed the first genome-scale method to identify genes with cell- 
lineage— specific expression, even in lineages not separable by experimental microdissection. Our machine-learning-based 
approach leverages high-throughput data from tissue homogenates in a novel iterative statistical framework. We applied 
this method to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter 
responsible for hereditary and most acquired glomerular kidney disease. In a systematic evaluation of our predictions by 
immunohistochemistry, our in silico approach was significantly more accurate [65% accuracy in human) than predictions 
based on direct measurement of in vivo fluorescence-tagged murine podocytes [23%). Our method identified genes 
implicated as causal in hereditary glomerular disease and involved in molecular pathways of acquired and chronic renal 
diseases. Furthermore, based on expression analysis of human kidney disease biopsies, we demonstrated that expression of 
the podocyte genes identified by our approach is significantly related to the degree of renal impairment in patients. Our 
approach is broadly applicable to define lineage specificity in both cell physiology and human disease contexts. We 
provide a user-friendly website that enables researchers to apply this method to any cell-lineage or tissue of interest. 
Identified cell-lineage-specific transcripts are expected to play essential tissue-specific roles in organogenesis and disease 
and can provide starting points for the development of organ-specific diagnostics and therapies. 

[Supplemental material is available for this article.] 



Cell-lineage differentiation plays a defining role in biology. Im- 
pairment of differentiated cell functions is responsible for the 
organ-specific manifestation of acquired chronic degenerating 
diseases, including Alzheimer's disease, diabetes, and chronic 
kidney disease (CKD). Defining lineage-specific cellular function 
in human physiology and disease remains challenging, as it is 
frequently impossible to physically isolate a specific cell lineage 
from the heterogeneous lineages that make up many solid human 
tissues. This inability to obtain a pure cell preparation from human 
tissue in vivo and to identify the functional context of cell lineages 
on a genome-scale is a significant barrier to developing an un- 
derstanding of molecular interactions in complex tissues and 
diseases. 

Here we develop a computational approach (Fig. 1) that iden- 
tifies genes specifically expressed in a cell lineage from high- 



throughput expression data of complex solid tissue biopsies. This 
problem is of significant biological and clinical relevance, however, 
obtaining a pure ex vivo cell population of sufficient size from the 
lineage of interest for direct assay is often technically infeasible, 
particularly when the lineage of interest is a component of solid 
tissues. This challenge imposes severe limitations on researchers' 
ability to account for the cell-lineage-specific expression and 
function of most human genes. This problem is distinct from the 
task of identifying the fractional composition of a heterogeneous 
sample (e.g., whole blood), and methods to address such problems 
require whole-genome expression measurements for each un- 
derlying cell type, which are unavailable for most solid human cell 
lineages (Shen-Orr et al. 2010). Our iterative machine-learning- 
based approach leverages heterogeneous expression data from 
human tissue homogenates. We term this approach "in silico 
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Iterative In Silico Nano-dissection 




Tissue Expression Correlation with Renal 
in Disease Function in Disease (CKD) 



Figure 1. Schematic overview of the in silico nanodissection workflow, an iterative approach for cell-lineage-specific gene prediction, validation, and 
functional analysis. Ex pert-cu rated literature annotations are iteratively combined with gene-expression data to predict genes specific to a cell lineage. 
These predictions are assessed, and the standards are refined. Validation of podocyte specificity of our predictions used publicly available resources 
followed by evaluation of intrarenal mRNA and protein expression analysis in correlation with clinical phenotypes to define regulation of predicted gene 
sets in human disease. 



nanodissection" because it is a computational approach that 
can analyze lineages that are not separable by experimental 
microdissection. 

Chronic diseases like diabetes and hypertension cause mor- 
bidity and mortality via alteration of differentiated organ function 
in a wide range of tissues. Here we use kidney disease as a proof of 
concept application, focusing on the podocytes, the highly dif- 
ferentiated glomerular epithelial cells responsible for most hered- 
itary and acquired glomerular disease (Gerstein 2001; Kim et al. 
2003; Roselli et al. 2004; Groop et al. 2009; D'Agati et al. 2011; 
Niewold 2011). As with most other differentiated cell lineages in 
solid tissues, discovering human podocyte-specific genes on a 
whole-genome scale has remained infeasible due to the challenge 
of obtaining pure ex vivo populations of sufficient size for high- 
throughput evaluation, making this important cell lineage an ideal 
proof of concept application for nanodissection. Restricted gene 
expression is defined in this study as podocyte "specific" within 
the renal context if it shows gene expression limited to podocytes 
within the kidney, and "podocyte specific within the renal 
glomerulus" if it is expressed only in podocytes within the 



glomeruli but detectable in other extraglomerular cell lineages in 
the kidney. Previous high-throughput strategies have relied on 
mouse (Endlich et al. 2002; Brunskill et al. 2011; Jain et al. 2011) or 
human (Saleem et al. 2008) immortalized glomerular visceral 
epithelial cells, but in vitro culture leads to rapid loss of both lineage- 
specific phenotypes and lineage-specific gene expression. Whole- 
tissue-based molecular profiles of renal disease are attainable 
(Henger et al. 2004; Higgins et al. 2004; Schmid et al. 2006; 
Bennett et al. 2007; Ju et al. 2009; Hodgin et al. 2010; Lindenmeyer 
et al. 2010; Woroniecka et al. 2011) since human renal tissue is 
routinely obtained by diagnostic fine needle biopsy, but whole- 
tissue expression profiles have not previously been capable of 
identifying gene expression at the cell-lineage level. This difficulty 
is not unique to renal disease. Similar challenges exist for other 
clinically important lineages, e.g., neuronal cell-lineage-specific 
markers in neurodegenerative diseases like Alzheimer's disease or 
multiple sclerosis. Employing a computational approach to identify 
cell-lineage-specific molecules for noninvasive monitoring of 
neuronal functional status would help to address one of the key 
challenges pursued in the study of such diseases (Reddy et al. 201 1). 
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When applied to renal gene expression data sets, our nano- 
dissection method predicts 136 genes not previously known to be 
podocyte specific. Through systematic immunohistochemistry- 
based evaluation, we show that our iterative in silico method sig- 
nificantly outperforms experimental strategies using fluorescence- 
activated cell sorting (FACS) separated GFP-tagged murine cells 
for identification of cell-lineage-enriched transcripts. We further 
demonstrate that expression of the nanodissection predicted 
podocyte-specific genes significantly correlates with kidney func- 
tion, as measured by the glomerular filtration rate (GFR), in pa- 
tients with CKD. The nanodissection method also predicts the 
most recently identified gene responsible for hereditary nephrotic 
syndrome (Mele et al. 2011). These findings reinforce the concept 
that defining cell-lineage-specific genes can provide important 
insights into the pathogenesis of and targeted therapies for de- 
generative human disease of the kidney, central nervous system, 
and other highly differentiated tissues. Our approach is freely 
available in a user-friendly website that allows researchers to easily 
explore any cell-lineage or tissue of interest (http://nano.princeton. 
edu) and through an open-source C++ library (http://libsleipnir. 
bitbucket.org). 

Results 

In silico nanodissection approach discovers cell-lineage-specific 
genes 

To discover cell-lineage-specific genes, we developed in silico 
nanodissection, an iterative computational approach that pre- 
dicts cell-lineage-specific expression of human genes using high- 
throughput genomic expression data derived from tissue homog- 
enates. This method uses an iterative machine learning framework 
that makes robust predictions, even when only limited prior 
knowledge about cell-lineage-specific markers is available (for de- 
tailed description of the approach, see Methods). Intuitively our 
approach discovers patterns of coexpression of the cell-lineage 
markers in whole-tissue homogenates from a variety of genetic 



backgrounds, physiological and pathophysiological states. The 
approach leverages human curated markers of the cell lineage of 
interest (podocyte in this case) (Supplemental Tables 1, 2) to 
identify the genetic or pathophysiological perturbations in which 
the expression patterns of these markers are predictive of their cell- 
lineage specificity. These patterns of informative conditions are 
identified from comprehensive transcriptional data sets derived 
from tissue homogenates, often represent only a small fraction of 
all data, and are likely reflective of the markers' biological func- 
tions. The condition-specific patterns are then used to identify 
additional cell-lineage-specific genes. Our approach uses an iter- 
ative algorithm to refine the weighting of informative perturba- 
tions in a manner robust to the limited availability of curated 
markers (gold standards). Each gold standard provides differential 
specificity (e.g., those based on double immunofluorescence, im- 
munohistochemical (IHC) staining, or RNA abundance) and qual- 
ity. With our strategy, standards are assessed without the need for 
genome-scale measurements from a pure sample of the cell lineage 
of interest. Our method is robust to variable standard quality, and 
the machine learning component of our approach by itself is not 
sufficient for this robustness (Supplemental Fig. 5). 

We applied this nanodissection strategy to a data set of 452 
microarray measurements for microdissected human kidney bi- 
opsies and predicted 136 genes with novel podocyte-specific ex- 
pression in the renal context (Supplemental Table 3). These repre- 
sented all non-gold-standard genes among the top 150 predictions. 
The selected genes are the set with the maximum F-measure (Sup- 
plemental Methods) as assessed by cross-validation where precision 
was weighted five times as much as recall and resulted in a number 
of genes practical for systematic verification and validation. In silico 
nanodissection separated known podocyte genes from genes spe- 
cific to the other glomerular cell lineages and tubular cells (Fig. 2), 
while a simple correlation-based approach failed to do so (Supple- 
mental Fig. 1; Supplemental Methods). 

The applicability of our nanodissection strategy is not limited 
to the podocyte: It can accurately separate genes from tissues 
as diverse as skin (skin fibroblast genes from melanocyte 



0.025 



0.010 




cell type 

podocyte 
^endothelial 

mesangial 
^tubular 



percentile 



Figure 2. In silico nanodissection. Distribution of cell-type-specific prediction by percentile, estimated using a Gaussian kernel. Genes are ordered on 
the x-axis from worst (zero percentile) to best (1 00th percentile). The dotted line shows in silico nanodissection cutoff for the top 1 36 genes. Nano- 
dissection successfully separates (area under the curve [AUC] 0.83) podocyte-specific genes (green) from genes specific to other renal cell lineages 
(glomerular endothelial in dark blue, glomerular mesangial in light blue, and tubular in red). 
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and keratinocyte genes) and neuronal tissue (astrocyte genes from 
other glial cell-specific genes) from publicly available expression data 
for the corresponding tissue homogenates (Supplemental Figs. 2, 3). 



podocyte-specific proteins in HPA ; 31 were found to have sufficient 
expression pattern for evaluation at the time of our study (using HPA 
version 7.0-2010.11.15). Of these 31 proteins, 20 (65%) were found to 



Confirmation of the specificity 
of in silico podocyte predictions 

In addition to the systematic evaluation 
by IHC staining (below), we found 
nanodissection-identified genes that were 
previously reported to have podocyte- 
specific expression patterns, such as PLA2R1 
(Beck et al. 2009) and GJA1 (Supplemen- 
tal Tables 3, 4; Yaoita et al. 2002; Sawai 
et al. 2006), but that were withheld dur- 
ing the expert curation. In addition to 
recapitulating past literature, this strategy 
predicted concurrent discoveries. While 
this manuscript was being prepared, 
two genes predicted by nanodissection, 
myosin IE (MYOIE) and PDZ and LIM 
domain 2 (PDLIM2), were shown to 
display podocyte-specific expression and 
play a role in renal function and 
hereditary and acquired glomerular dis- 
ease (Mele et al. 2011; Sistani et al. 2011). 
MYOIE mutations were shown to cause 
childhood-onset, glucocorticoid-resistant 
focal segmental glomerulosclerosis (FSGS) 
(Ingelfinger 2011; Mele et al. 2011), and 
PDLIM2 exhibited a reduced expression 
in patients with minimal change disease 
(MCD) and membranous nephropathy 
(MN) (Sistani et al. 2011). 

We used high-throughput IHC 
stainings from the Human Protein Atlas 
(HPA) (http://www.proteinatlas.org) to 
systematically validate the podocyte- 
specific expression of genes identified by 
nanodissection. Although genes with 
staining data available in the HPA were 
not annotated to the level of cell-lineage 
localization, intraglomerular cell types 
were identified based on their localization 
pattern inside of the glomerular tuft by 
three investigators with expertise in re- 
nal histopathology independently in a 
blinded manner (see Methods). This en- 
abled us to systematically evaluate our 
predictions by IHC and to compare the 
performance of in silico nanodissection 
with experimental predictions from 
in vivo fluorescence-tagged murine 
podocytes (Fig. 3). 

As the first step of the validation 
strategy, a blinded evaluation of the lo- 
calization of predicted podocyte proteins 
in IHC staining images from HPA was 
performed for predicted podocyte- 
specific transcripts and an equivalently 
sized set of randomly selected genes (in- 
cluded as control). Of the predicted 136 




Nano-dissection 



Mouse experimental 
approach 



Random geneset 




% of podocyte specific genes 
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Figure 3. Evaluation of podocyte-specific genes based on qualified Human Protein Atlas (HPA) 
staining images. (A) HPA images demonstrate podocyte-specific pattern of positive standard markers 
and predicted genes. Staining pattern of positive standard markers: podocyte-specific in kidney (I): 
nephrin (NPHS1 ) and podocyte-specific in glomerulus: (II) SYNPO and (///) CD2AP. Exemplary staining 
patterns for de novo nanodissection predicted proteins (IV-IX): (IV) FGF1; (V) ARHGAP28; (VI) 
PRKAR2B; (VII) PCOLCE2; (VIII) GJA1; and (/X) ZDHHC6. (B) HPA-based distribution intrarenal protein 
staining pattern in random gene set, nanodissection-identified gene set, and the murine experimental 
approach-derived gene set: The in silico nanodissection approach (65%) significantly outperforms 
a random set of genes (12%) and the ex vivo murine experimental approach (23%) for identifying 
podocyte-specific genes. Gray bars show the proportion of genes with exclusively podoctye-specific 
staining within the kidney, and black bars show the proportion of genes with exclusively podocyte- 
specific staining within the glomerulus. 
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be podocyte-specific within the renal context, with staining for seven 
(23%) exclusively attributed to the podocyte within kidney tissue and 
an additional 13 proteins (42%) stained exclusively to the podocyte 
within the glomerulus. The other 11 proteins (35%) were stained to 
other renal cells (Fig. 3B; Supplemental Table 4). The nanodissection- 
identified gene set significantly (Fisher's exact p = 3.256 X 10~ 6 ) 
outperformed the control (background) genes, of which one (2%) 
showed podocyte-specific staining within the kidney and five 
(10%) showed podocyte-specific staining within the glomerulus 
(Fig. 3B). Forty- two (88%) background genes were stained to other 
renal cells. 

In contrast to humans, where podocyte-specific microdissec- 
tion is technically infeasible, murine transgenic model systems have 
been developed by the Genitourinary Development Molecular 
Anatomy Project (GUDMAP) consortium specifically to define cell- 
lineage-specific genes (McMahon et al. 2008). Lineage tracing was 
established by GFP expression using cell-type-specific promoters, 
followed by FACS and genome-wide expression profiling of GFP- 
positive single-cell suspension (Brunskill et al. 2011). Using the 
podocyte, mesangial, and endothelial cell-lineage-specific GUDMAP 
expression data sets (Supplemental Table 5), we subtracted en- 
dothelial and mesangial gene expression profiles from the tran- 
scriptome obtained from the podocyte preparation and identi- 
fied 102 podocyte-specific transcripts (Methods; Supplemental 
Table 6; McMahon et al. 2008). These transcripts underwent the 
same cell-lineage-specific evaluation in HPA as the in silico 
nanodissected human transcripts. Thirty of the 102 murine ex- 
perimental approach-derived podocyte-specific transcripts had 
staining patterns identifiable in HPA, of which staining for two 
was exclusively attributed to the podocyte within human kidney 
tissue and five proteins were stained to the podocyte within the 
glomerulus (Fig. 3B). Thus the in silico nanodissection approach 
exhibited a significantly higher accuracy (Fisher's exact p = 0.0059) 
than the murine experimental strategy for discovering transcripts 
with podocyte-specific expression (nanodissection's 65% vs. 
murine experimental approach's 23% of predictions confirmed as 
podocyte-specific). The in silico prediction accuracy of cell-lineage 
enrichment using human tissue homogenate exceeded that ob- 
tained from in vivo fluorescence-tagged and sorted cells in a murine 
model system. 

Disease-specific regulation of the nanodissection gene sets 

To test the hypothesis that the discovered podocyte-specific genes 
were associated with human renal disease, the transcript with the 
highest podocyte-specific score and positive HPA validation, pro- 
collagen C-endopeptidase enhancer 2 (PCOLCE2), was selected for 
further characterization. PCOLCE2 protein modulates binding of 
procollagen C-proteases to collagen in a BMP 1 -dependent and cell- 
lineage-restricted manner (Steiglitz et al. 2002), a process with sig- 
nificant relevance for the development and function of the glo- 
merular basement membrane (Tanaka et al. 2010). We investigated 
the disease-specific transcriptional regulation of PCOLCE2. The 
steady-state mRNA level of PCOLCE2 in glomeruli from human 
renal biopsies was significantly repressed in patients with FSGS 
(n = 17), a glomerular disease with podocyte damage and end- 
stage renal disease (ESRD), compared with controls (n = 39, p < 
0.05) (Fig. 4A). In contrast, in glomeruli from patients with MCD 
(n = 12), a proteinuric disease without progression to ESRD, 
PCOLCE2 transcript levels were not significantly altered. In a co- 
hort of CKD patients with heterogeneous glomerular patho- 
physiology (n = 139), loss of PCOLCE2 glomerular gene expression 



was significantly correlated with loss of renal function (r = 0.32, 
£ = 1.17 X 10" 4 ). 

Disease-specific PCOLCE2 regulation was further validated in 
human kidneys affected by glomerular disease using IHC staining in 
an independent biopsy cohort. In concordance to the IHC staining 
patterns reported in HPA (Fig. 3A, VII), the podocyte-specific local- 
ization of PCOLCE2 protein was confirmed. In contrast to the nuclear 
and perinuclear PCOLCE2 signal seen in IHC in glomeruli of five 
healthy kidneys (Fig. 4B, I), PCOLCE2 staining was not detectable in 
glomeruli from eight FSGS patients (Fig. 4B, II), demonstrating the 
ability of the nanodissection strategy to detect genes with both cell- 
lineage-specific expression and disease-specific alteration in glomer- 
ular failure. 

Glomerular disease stratification by the de novo predicted 
podocyte-specific transcripts 

Podocyte damage leads to progressive loss of kidney function and the 
need for dialysis and renal transplantation. To test the association 
with glomerular function, the glomerular regulation of the podocyte 
gene set discovered by nanodissection was compared between 39 
controls and 1 7 patients with FSGS, a renal disease caused by severe 
podocyte damage (Kriz et al. 1994; Pavenstadt 2000) and the leading 
cause of glomerular failure in children. Using significance analysis of 
microarrays (SAM) (Tusher et al. 2001), 60 of the 136 genes identified 
by nanodissection were significantly repressed in glomeruli from 
FSGS patients versus controls (g-value < 0.05). 

Next, the regulation of predicted transcripts in chronic renal 
disease was evaluated in a cohort of 139 patients with glomerular 
diseases, including FSGS, diabetic nephropathy (DN), IgA nephrop- 
athy (IgAN), MN, lupus nephritis (LN), and MCD (Supplemental 
Table 7). Steady- state mRNA expression measurements of nano- 
dissection predicted podocyte-specific transcripts were correlated 
with GFR at the time of biopsy, currently the best overall index of 
kidney function used to classify the stages of CKD patients by the 
Kidney Disease Outcomes Quality Initiative (KDOQI). Expression of 
the set of 136 de novo predicted podocyte-specific genes was sig- 
nificantly (p < 0.01) more correlated with GFR than observed in 
permuted gene-GFR associations (Fig. 4C). These results demon- 
strate the potential for predicted podocyte-specific genes to be 
candidate markers for disease progression. This finding has sig- 
nificant clinical utility, as the cell-lineage-specific and disease- 
associated genes can provide superior specificity for biomarker 
testing in heterogeneous biofluids like urine or blood compared 
with ubiquitously expressed disease markers. The GFR correlation of 
the podocyte-specific gene set predicted by nanodissection supports 
the tight link of podocyte differentiation and function with renal 
impairment irrespective of initiation of renal disease by genetic or 
environmental causes. 

Application of nanodissection to nonpodocyte lineages 

Mesangial cells are one of the three major cell types in kidney 
glomeruli, and mesangial expansion is a hallmark of DN. We in- 
vestigated the expression profile of the top 52 mesangial cell- 
specific genes predicted by nanodissection (cutoff based on the 
same F-measure criterion) (Supplemental Fig. 6) in an independent 
DN data set (data include glomerular gene expression profile of 13 
healthy donors and nine patients with DN) (Woroniecka et al. 
2011). In this data set, 50 of the 52 predicted mesangial cell- 
specific genes showed robust expression in the microdissected 
glomeruli. Forty-four percent of these genes (22 out of 50) 
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exhibited increased steady-state mRNA levels in patients with DN 
compared with controls (SAM analysis q < 0.01); no transcript 
showed significantly reduced mRNA levels. This demonstrates 
both nanodissection's applicability to other renal cell lineages, as 
well as its ability to identify lineage-specific genes with increased 
mRNA levels. 

We further evaluated nanodissection on nonrenal cell line- 
ages. For this analysis, we used tissue annotations from the Human 
Protein Reference Database (HPRD) (Keshava Prasad et al. 2009) as 
gold standards for tissue-specific expression. For example, we have 
applied nanodissection to identify genes specifically expressed in 
skin fibroblasts (density estimate cross-validation-based evalua- 
tion) (Supplemental Fig. 3). We evaluated genes above the maxi- 
mum F-measure criterion for significant enrichment in disease 
annotations from Online Mendelian Inheritance in Man (OMIM). 
The genes above the maximum F-measure criterion showed signif- 
icant enrichment of genes involved in two collagen disorders 
with known fibroblast expression: Ehlers-Danlos syndrome and 
Osteogenesis imperfecta. Six of the 10 genes associated with 
Ehlers-Danlos syndrome were above this threshold (false discovery 
rate [FDR] -corrected; q = 0.00038), as were six of the eight genes 
associated with Osteogenesis imperfecta (FDR-corrected; q = 
0.0001 1). None of these genes were included in the fibroblast gold 
standard from HPRD used by nanodissection; all were new pre- 
dictions of fibroblast-specific genes. This demonstrates both the 
potential of nanodissection to identify cell-lineage-specific genes 
as well as the potential for those genes to be associated with cell- 
lineage-specific diseases. 

Discussion 

Organ-specific transcriptional programs define the final stages in 
tissue development and the mature function of metazoan organ- 
isms. Alterations in the functions of genes with cell-lineage- 
restricted expression patterns are widely believed to lead to tissue- 
specific disease manifestations. Furthermore, inherited diseases are 
frequently caused by mutations in genes with restricted expression 
patterns (Winter et al. 2004; D'Agati 2008; Cai and Petrov 2010). 
Mutations in such genes often do not cause early embryonic le- 
thality but rather manifest disease at the time when the function of 
these genes becomes critical for a specific tissue and subsequently 
for organismal survival (D'Agati 2008). In acquired disease like di- 
abetes or hypertension, the vulnerability of a specific organ to the 
systemic disease is defined by the expression of tissue-specific genes 
(Doublier et al. 2003; Koop et al. 2003; Woroniecka et al. 2011). 
Defining cell-lineage-specific transcripts therefore has immediate 
clinical implications for such cell lineages. However, a major chal- 
lenge to define a specific cellular transcriptome has been the in- 
ability to obtain pure cell preparation from human tissue in vivo 
(e.g., as recently summarized by Lindenmeyer et al. [2010] for renal 
cell lineages). 

To identify cell-lineage-specific transcripts on a genome-wide 
scale even when direct experimental assays are infeasible, as is 
the case for most solid human tissues, we developed in silico 
nanodissection. This iterative machine-learning-based approach 
robustly leverages existing knowledge about the cell lineage of interest 
to identify transcripts with similar behavior in heterogeneous tran- 
scriptional data sets of tissue homogenates. In silico nanodissection 
does not require expression data of pure genome-wide profiles 
from the cell lineage of interest and is robust to small numbers and 
varying specificity of available cell-lineage markers. Although our 
strategy uses support vector machines (S VM) as the machine learning 



component of the nanodissection method (Fig. 1), in principal any 
machine learning approach that leverages positive and negative ex- 
amples for training can be integrated in place of the SVM. 

This study represents the first high-throughput approach for 
identification of cell-lineage-specific genes for any cell lineage 
from in vivo human data. The approach is general — we found that 
our predictions remained robust (significantly overrepresented by 
podocyte-specific genes) even when the directly targeted expres- 
sion data (renal glomeruli) constitutes only 5% of the total data 
sets (the rest being diverse human expression data from the Gene 
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Expression Omnibus [GEO]). In silico nanodissection applied to 
public gene expression data from tissue homogenates (imple- 
mentation available through our nano.princeton.edu website) is 
also capable of accurately separating cell-lineage-specific genes in 
skin (skin fibroblast genes from melanocyte and keratinocyte 
genes) and neuronal tissue (astrocyte genes from other glial cell- 
specific genes). Furthermore, this approach is effective even with 
a limited number of positive marker genes or in situations when 
some provided lineage-specific genes are inaccurate/irrelevant to 
the cell lineage (Supplemental Figs. 4, 5). This makes in silico nano- 
dissection a promising approach to identify cell-lineage-specific 
genes that might be potentially associated with other acquired or 
inherited diseases, for which targeted data may not be available. 

Cell-type-specific markers predicted by nanodissection could 
be used to extend the applicability of methods that define the 
fractional composition of a mixture. These methods are then ca- 
pable of deconvoluting an expression signal to perform tests of 
gene expression within individual lineages. While some approaches 
require knowledge of the mixture's fractional composition, which 
limits their application to cell types measurable in cell sorter ex- 
periments (Shen-Orr et al. 2010), others can start with either pure ex 
vivo samples of each mixture component to estimate the sample 
composition or a high quality set of known markers (Kuhn et al. 
2011). These requirements currently limit the applicability of these 
methods, as they require information not available in studies of 
most complex solid human tissues. Markers identified by nano- 
dissection provide a promising starting point for the application of 
such deconvolution approaches to many more human cell lineages, 
including those not amenable to experimental microdissections. 

Nanodissection is the first genome-scale method that iden- 
tifies cell-lineage-specific genes important for renal disease in 
humans. In addition to the de novo identified podocyte-specific 
genes, genes that have been shown by genetic or functional studies 
to be causally involved in hereditary glomerular diseases and 
chronic progressive renal failure, including DA CHI (nanodissection 
rank 184) (Kottgen et al. 2010), APOL1 (rank 185) (Genovese et al. 
2010; Kopp et al. 2011), VEGFA (rank 247) (Eremina et al. 2008; 
Kottgen et al. 2010), and MYH9 (rank 260) (Kao et al. 2008; Kopp 
et al. 2008), were ranked highly by our method. During the prepa- 
ration of this manuscript, MYOIE (rank 75) was reported to be 
associated with autosomal-recessive, glucocorticoid-resistant ne- 
phrotic syndrome (Mele et al. 2011). MYOIE was found to exhibit, 
as predicted by our study, podocyte-specific expression and 



Figure 4. Regulation of predicted podocyte-specific gene set in human 
disease. (A) Box-and-whisker plot of glomerular mRNA expression of 
PCOLCE2 in biopsies from living donor controls (LD, n = 35), minimal 
change disease (MCD) patients (n = 12), and focal segmental glomer- 
ulosclerosis (FSGS) patients (n = 1 9). Asterisk denotes a significant differ- 
ential expression (p < 0.05). (B) IHC staining of PCOLCE2 on kidney 
biopsies from controls (/) and FSGS patients (//). In comparison with 
control kidneys, PCOLCE2 signal disappears in FSGS patients. Images 
shown are the representative images in the glomerulus of controls (n = 5) 
and FSGS patients (n = 8). (C) Density plot of the association (Pearson 
correlation, x-axis) of the 1 36 predicted podocyte-specific genes (red) 
with renal function as quantified by GFR value, compared with density 
plot of repeatedly (100 times) randomized gene expression-GFR asso- 
ciations (black). The randomized set shows a distribution centered on 
zero (meaning no correlation with GFR), whereas the podocyte-specific 
genes show a skewed distribution toward positive correlation, indicating 
reduced gene expression is associated with impaired renal function. 
Correlation with GFR of the 1 36 transcripts across all renal diseases ana- 
lyzed was significantly enriched compared with the permuted sample 
(p < 0.01). Black line indicates the correlation of PCOLCE2 mRNA level 
with GFR. 



appears to interact with other cell-lineage-specific genes in 
podocyte cytoskeletal dynamics. PDLIM2 (rank 121), another 
gene identified by nanodissection, was recently reported to show 
podocyte-specific expression and repression in acquired glomerular 
disease (Mele et al. 2011; Sistani et al. 2011). Interestingly, neither of 
these genes is present in the list of 102 genes (Supplemental Table 6) 
identified as podocyte-specific using the murine ex vivo cell-lineage 
separation in the GUDMAP data sets. Brunskill et al. (2011) recently 
generated a transcriptional data set (144 genes) regulated during 
murine podocyte development and enriched in adult podocytes in 
comparison with the renal cortex. Analysis in HPA of the human 
orthologs exhibited a similar enrichment to the GUDMAP ex vivo 
cell-lineage data set used in Figure 3 (two podocyte-specific in kid- 
ney [5%] and 13 podocyte-specific in glomeruli [30%]), but the 
murine data did not reach the specificity of the in silico nano- 
dissection approach (65%). 

Cell-lineage-specific strategies capable of identifying genes 
associated with a disease provide additional value compared with 
unbiased genome-wide approaches that identify genes with ex- 
pression correlation to a specific phenotype (like GFR). The latter 
captures a different pool of transcripts: Abundantly expressed non- 
cell-lineage-specific genes constitute the majority of the tran- 
scripts correlated with renal function, but these transcripts do not 
necessarily perform cell-lineage-specific functions and may not be 
associated with hereditary disease. For example, expression levels 
of MYOIE and APOL1 are not strongly correlated to GFR (ranked 
1236 and 7430, respectively, by GFR-expression correlation), yet 
these two genes were identified by nanodissection to be cell-type- 
specific and have been shown experimentally (independent of 
and parallel to our work) to cause hereditary glomerular disease 
(Genovese et al. 2010; Kopp et al. 2011; Mele et al. 2011) Fur- 
thermore, a systematic analysis focusing on literature-curated he- 
reditary FSGS-associated genes that do not overlap with our 
podocyte gold standard (see Supplemental Methods) also demon- 
strates that a genome-wide assessment of GFR-transcript expres- 
sion correlation alone could not identify genes associated with 
this hereditary renal disease (Supplemental Fig. 8A). In contrast, 
nanodissection can identify genes associated with disease, with 
FSGS genes receiving significantly higher podocyte nanodissection 
scores than those without known FSGS association (Supplemental 
Fig. 8B). Thus, nanodissection's ability to identify cell-lineage 
specificity is important for identifying genes potentially associ- 
ated with such diseases and clinical phenotypes. Beyond simply 
addressing issues of statistical power, methods that consider cell- 
lineage specificity provide additional utility because they address 
targeted biological questions that are tightly coupled to the dis- 
ease etiology. 

Our findings have significant potential for clinical utility. In 
the study of hereditary diseases, next-generation exome sequenc- 
ing technologies are now widely applied across hereditary diseases 
and are capable of identifying putative causal genetic variants in 
very small pedigrees. However, these studies often result in mul- 
tiple candidate genes in need of further prioritization. As heredi- 
tary diseases are often caused by cell-lineage-specific transcripts 
(see above and Hinkes et al. 2006), the systematic scoring sys- 
tem for cell-lineage-specific enrichment provided by the nano- 
dissection approach can become a crucial tool to prioritize candidate 
genes for further validation using their cell-lineage enrichment 
scores. Vice versa, several hundreds of tissue specific genes iden- 
tified by nanodissection can be screened comprehensively in 
families with a hereditary disease of the organ of interest using 
targeted exon sequencing strategies as currently is pursued by our 
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group in a rare disease cohort (Halbritter et al. 2012; Gadegbeku 
et al. 2013). 

For acquired chronic disease, the search is still ongoing to 
define specific and robust biomarkers of differentiated organ 
function. Unbiased molecular screening approaches have been 
largely disappointing in this context. Our data strongly support 
the close association of cell-lineage-specific transcripts and loss of 
end organ function in complex, chronic human kidney disease. 
Proteins encoded by these genes may be detected in plasma and 
urine and provide a noninvasive means to measure organ function 
in a cell-lineage-specific manner. In contrast to ubiquitously 
expressed molecules involved in fibrosis and inflammation, which 
are currently the most common source of candidate biomarkers for 
chronic diseases, cell-lineage-specific biomarkers are less likely to 
be confounded by extrarenal processes and should provide supe- 
rior diagnostic specificity (Fukuda et al. 2012). This has been 
demonstrated in the context of podocyte failure in model systems 
and human disease (Sato et al. 2009), and nanodissection provides 
an opportunity to expand the scope of podocyte-specific tran- 
scripts analyzed in a complex mixture of urinary cells by these 
approaches. Finally, functional studies of the cell-lineage-specific 
genes identified in human disease tissue offer the opportunity to 
develop a targeted therapeutic approach for chronic disease. Tar- 
geting a disease-specific molecular mechanism selectively in the 
tissue manifesting the disease has the potential to significantly 
increase efficacy while reducing off-target effects. 

In summary, nanodissection is a novel computational ap- 
proach for defining the specificity of cell types at the transcrip- 
tional level. As demonstrated for glomerular disease, but applicable 
across all organs with large-scale transcriptional data sets available, 
nanodissection can reveal novel transcripts with essential tissue- 
specific function in organogenesis and hereditary human disease. 
In chronic progressive diseases, the nanodissection-identified tran- 
scripts can serve as highly specific markers of disease stages and 
provide a starting point for the development of organ-specific tar- 
geted therapies. While we have shown that HPRD annotations can 
guide successful nanodissection analyses, we believe the method 
is most powerful when combined with high-quality user con- 
structed standards, which can be easily accomplished using our 
nanodissection web server. Nanodissection can be performed on user- 
curated, tissue-specific gene expression compendia via the user- 
friendly nanodissection web server at http://nano.princeton.edu. 
This web server includes 452 microarrays from microdissected kid- 
ney biopsy samples from this study, as well as 7539 samples across 
28 diverse human tissue collections manually curated from the Gene 
Expression Omnibus (GEO). Nanodissection of investigator-specific, 
gene-expression data sets can be performed with the Sleipnir 
library for functional genomics (version 3 or higher) available 
for Windows, OS X, and Linux systems from http://libsleipnir. 
bitbucket.org/ (Huttenhower et al. 2008). 

Methods 

Patient characteristics 

Human renal biopsy specimens were procured through an in- 
ternational multicenter study, the European Renal cDNA Bank- 
Kroener-Fresenius biopsy bank. Biopsies were obtained from pa- 
tients after informed consent and with approval of the local ethics 
committees. All biopsies were stratified by the reference patholo- 
gist of the ERCB according to their histological diagnoses. Histol- 
ogy reports, clinical data, and gene expression information were 
stored in a de-identified manner. A total of 452 microarrays from 



kidney biopsies were used for nanodissection, of which 139 
patients were used for kidney function correlation analysis. 
Demographic data of these 139 patients are provided in the Sup- 
plemental Table 7. 

Microdissected human kidney biopsy data 

Microdissection into glomerular and tubule-interstitial compart- 
ments and Affymetrix-based gene expression profiling were per- 
formed according to the method previously reported (Ju et al. 
2009). Affymetrix GeneChip Human Genome U133A 2.0 and 
U133 Plus 2.0 Array were used in this study. For this analysis, we 
restricted ourselves to only the probesets present on both plat- 
forms. Normalized data files are uploaded on the GEO (Edgar et al. 
2002) website and accessible under reference numbers GSE32591 
(Berthier et al. 2012), GSE35488 (Reich et al. 2010), GSE37455 
(Berthier et al. 2012), and GSE37460 (Berthier et al. 2012). For 
simplicity, we use "in vivo" to refer to these assays of genes mea- 
suring gene expression in human biopsies of complex tissues. 

In silico nanodissection for the prediction 
of cell-lineage-specific gene expression 

Our approach uses machine learning within a novel iterative 
framework to predict genes with cell-lineage-specific expression 
on the whole-genome scale based on gene expression data from 
tissue homogenates. This problem is especially challenging be- 
cause, in order to work for cell lineages that are infeasible to 
microdissect experimentally such as the podocytes, our approach 
must function without example expression profiles of the lineage 
of interest. 

Intuitively, our method leverages patterns of expression of 
cell-lineage-specific genes that it discovers from whole-genome 
expression compendia not resolved to the cell lineage of interest. 
These patterns are specific for each cell lineage and generally only 
found in a small subset of experimental conditions, which may 
include genetic, physiological, pathophysiological, environmen- 
tal, or experimental states/perturbation (e.g., biopsy specimens 
from different patients). To discover these cell-lineage-specific ex- 
pression patterns as well as the subsets of conditions that are in- 
formative for a given cell lineage, our approach uses a machine 
learning approach in an iterative probabilistic framework to com- 
bine an expert-provided standard of known cell-lineage-specific 
genes (positives) as well as example genes that are expressed in 
other cell lineages (negatives). However, most solid- tissue cell lin- 
eages cannot be studied experimentally in high-throughput, and 
thus only few cell-lineage-specific genes are often known with 
high accuracy (e.g., from IHC). The additional challenge here is 
that these standards are often limited in size (especially for cell 
lineages not amenable to experimental micro dissection) and can 
be of varying specificity (e.g., specific to cell lineage within the 
immediate structure or whole organ or defined by different ex- 
perimental approaches). 

Because it is experimentally infeasible to obtain pure example 
expression profiles for cell lineages from solid human tissues, our 
method must perform well even while available standards are of- 
ten very limited in size and can be of highly varying specificity. 
This paucity of high-quality standards and the need to effectively 
leverage lower-quality or less specific examples severely limits the 
direct application of traditional machine learning approaches (e.g., 
SVM performance outside of the iterative framework is shown in 
Supplemental Fig. 5). 

To address these challenges, we developed an iterative classi- 
fication approach that continually refines both the predictive cell- 
lineage-specific patterns and informative conditions based on 
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statistical scoring and refinement (through informative subset se- 
lection) of the provided standard. This iterative approach allows 
the user to provide tiered standards, i.e., the investigator identifies 
only the relative specificity of evidence tiers (i.e., low-throughput 
high specificity approaches are more reliable as compared to high- 
throughput experimental platforms with lower specificity). The in 
silico nanodissection method is then able to make high-accuracy 
predictions of cell-lineage-specific genes on the whole-genome 
scale and, within the tiered standard constraint, is robust to vari- 
able specificity of example cell-lineage-specific genes. The iterative 
strategy is necessary to allow investigators to add standards of 
questionable quality without dramatically compromising the qual- 
ity of cell-lineage predictions. A linear SVM without this iterative 
approach fails when standards of lower quality are added to high- 
quality standards (Supplemental Fig. 5). 

The researcher defines standards within tiers. Tiers represent 
levels of specificity (i.e., in descending order: double immunofluo- 
rescence, annotated in literature curated database, high-throughput 
protein expression). For each tier, nanodissection calculates the sum 
of the ranks of genes from the classifier (for the case of SVM, this is 
the ranked distance from the SVM hyperplane) for each positive 
example, R it (here podocyte genes) against each of M negative 
standards, /, (e.g., glomerular, mesangial, tubular) as SRj = Y^ p =iRi, 
where n p represented the number of positives and ranks were 
calculated from only the positive examples and the negative 
examples from standard /. It then computes a test statistic for 
this individual separation, Uj for each negative standard as 
Uj = max(C ; -, n p rij — Cj), where 



n p (n p + l) 



SRj. 



This is normalized by converting it to a z-score by using the 
mean and standard deviation through 



njUpjUj + rip + l) 
12 



The scores for the individual separations are then combined to 
provide a final score for this tier of standards 



M 



f[(l -Cdf(Zj)). 



Nanodissection automatically selects the standards resulting in 
the lowest p (which ranges from zero to one), i.e., that which 
corresponds to a better separation of positives from each negative 
standard. 

In certain cases, an additional (and optional) external valida- 
tion gene set may be available. Because nanodissection can be ap- 
plied where experimental microdissection was insufficient, these 
standards may represent both positives and negatives (e.g., in this 
case where additional microanay measurements of the renal glo- 
merulus were available as validation). We termed genes in this 
standard as "high-throughput- validating" genes and other genes as 
"nonvalidating" genes. Nanodissection can use this validation set to 
identify the set of standards providing the best separation of vali- 
dating genes by calculating SR = Y^ v = X R ( | d\ \ ) , where R( \ d\ \ ) is the rank 
of the absolute value of the distance to the hyperplane of the 



validating gene i in a list containing the n v validating genes 
and the n nv nonvalidating genes. It then calculates U as 

U = max(C. n v n nv - C), where 



which is then converted to a z-score 

Yl\>Ylnv 



u- 



z = 



n v n nv (n v + n nv + l) 



12 



Finally, p for validating versus nonvalidating is calculated as 
p- 1 - cdf(z). Selecting the standard tier that provided the lowest 
p results in the standard where validating genes were most 
extreme (i.e., best separated from each other). Our results dem- 
onstrate that this approach enables us to use a non-cell-lineage- 
specific validation (i.e., glomerular) gene set to grade our sepa- 
ration of putative cell-lineage (podocyte) -specific genes by 
selecting that standard that leads to example genes on the ex- 
tremes (in our example, this has potential podocytes at the top of 
the list and potential nonpodocyte glomerular genes at the bot- 
tom). In the case where there exists a validation standard of high- 
quality specific to our cell lineage of interest, we instead use R(d) 
directly instead of R(\d\). In that case, this value would represent 
the one-sided Wilcoxon rank-sum p-value for a comparison of 
validating and nonvalidating genes. Because this iterative nano- 
dissection approach relies on genome-scale data obtained from the 
surrounding compartment and because this evaluation was used to 
identify the optimum standards, this p provides a quality measure 
for the resulting standard. Thus nanodissection allows us to ob- 
tain cell-lineage-specific signal from in vivo human data. 

The nanodissection algorithm therefore proceeds as follows 
(for pseudocode, see Supplemental Fig. 7 ). Given user-supplied 
standards in tiers of increasing specificity, for each standard-level, 
k, combine standards of that level with all standards of higher 
specificity levels. Apply the selected classification algorithm (here 
we applied SVM from the SVM perf package [Joachims 2006] using 
the Sleipnir library [Huttenhower et al. 2008]) and generate a ranked 
list of predictions. Score the predictions for k as described above to 
calculate p for the kth level of specificity. Select the level of speci- 
ficity providing the lowest p. 

In this work, standards were obtained from expert literature 
review. The positive podocyte-specific standard genes were re- 
quired to have at least one of the following levels of evidence: 
immunofluorescence staining, in situ hybridization, or electron 
microscopy image of immuno-gold staining of podocytes in vivo. 
Two levels of specificity were evaluated. The most stringent level 
contained genes specifically expressed only in podocytes and no 
other cell types in the human kidney, referred to as podocyte- 
specific in kidney (as an example, see nephrin staining pattern in 
Fig. 3 A, I). The less stringent level contained all of the above, as well 
as genes expressed in podocytes and no other cell types in glo- 
meruli, but did contain genes detected in extraglomerular cells of 
the kidney (synaptopodin [SYNPO] and CD2AP staining in Fig. 3 A, 
II and III). For the majority of selected genes, evidence for disease 
association in human glomerular failure or murine model systems 
was also available. Application of nanodissection resulted in the 
use of both tiers of standards, which corresponded to a total of 46 
genes that were both podocyte-specific and present in the gene 
expression data set. 
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Gene expression data extraction from the GUDMAP 
and data processing 

The genome-wide expression data of murine podocytes, mesangial, 
and endothelial cells were obtained from http://www.gudmap. 
org/. The identifications of data sets that have been utilized in our 
study are listed in Supplemental Table 5. The detailed protocol of 
data generation is described in a recently published paper by 
Brunskill et al. (2011) and can also be found in our Supplemental 
Methods. We preprocessed and normalized data as described in 
Microdissected Human Kidney Data. By comparing the expression 
level in podocytes versus the other two major cell types in glo- 
meruli, we define a gene to be podocyte-specific if its expression in 
podocytes is 4.76-fold over mesangial cells and 4.65-fold over 
glomerular capillary endothelial cells. The cut-off values represent 
three standard deviations of the average difference between 
podocytes and mesangial/endothelial cell transcripts, respectively. 
By use of HomoloGene from NCBI Entrez (Maglott et al. 201 1), 102 
murine genes could be mapped to their Homo sapiens ortholog 
(Supplemental Table 6). 

Evaluation of intrarenal protein localization in HPA 

We evaluated the intrarenal localization pattern of protein prod- 
ucts of predicted genes based on HPA 7.0-2010.11.15. Intra- 
glomerular cell types are identified based on their localization 
pattern inside of the glomerular tuft by three investigators with 
expertise in renal histopathology independently in a blinded 
manner. Conflicts were resolved by a majority vote. The following 
staining patterns were considered inconclusive and excluded from 
the analysis: (1) proteins with negative staining or "data not 
available"; (2) proteins with only a single renal histology image 
available; and (3) proteins with a diffuse nonspecific staining 
pattern. If several antibodies were evaluated for a specific protein, 
the images from the antibody with the highest degree of specificity 
were used for evaluation. Tubular brush border staining was con- 
sidered unspecific. In general, protein localization patterns were 
classified into three groups: (1) expressed exclusively in podocytes 
with no other cell types exhibiting staining in kidney section, re- 
ferred to as "podocyte-specific in kidney" (i.e., nephrin [NPHS1] 
in Fig. 3A, I); (2) expressed specifically in podocytes within the 
glomerulus but with positive staining also observed in tubular- 
interstitial compartments, referred to as "podocyte-specific in 
glomerulus" (i.e., SYNPO and CD2-associated protein [CD2AP] in 
Fig. 3A, II and III, respectively); and (3) all remaining staining 
patterns as "other renal cell." For clarity, we refer to category 1 and 2 
in aggregate as "podocyte-specific" genes. 

IHC staining of kidney biopsy tissues 

Following a previously described protocol (Lorz et al. 2008), IHC 
studies were performed using a PCOLCE2 primary rabbit antibody 
(HPA013203, Sigma-Aldrich). 

Association of RNA expression of cell-lineage-specific genes 
to kidney function 

The expression of the top 136 candidate genes in a cohort of 139 
patients with CKD was correlated to the square root of the GFR, 
calculated by MDRD equation (modified diet in renal disease for- 
mula) (Levey et al. 2006), using Pearson correlation. The correla- 
tions were compared against a randomized set by analyzing their 
correlation density plots using the sm R package (Bowman and 
Azzalini 2010). Randomization was performed by randomly reas- 
signing expression values to GFR 100 times on the given data set, 
followed by recalculation of the correlation. 



Manual tissue-of-origin sample annotation for the web server 

Microarray experiments (Affymetrix GeneChip Human Genome 
U133 Plus 2.0) were manually annotated to the sample's tissue of 
origin using the controlled vocabulary in the Brenda Tissue On- 
tology. To ensure wide coverage of tissue types, a broad set of 
candidate samples for each tissue was identified with an initial 
term matching, corrected for linguistic variations with stemming, 
for each Brenda term (and its synonyms) on sample descriptions 
available in GEO. These term-to-experiment matches were then 
manually curated, verified, or corrected based on the correspond- 
ing sample descriptions. Only matches for terms across at least two 
independent data sets were reviewed. Only the tissue-of-origin 
information was considered in the manual evaluation, and so, 
tumor-adjacent normal breast biopsy samples were correctly an- 
notated to "breast," for example. We excluded tissue mixture 
samples, reference samples, and nonhuman samples. We also ex- 
cluded samples with ambiguous descriptions, as well as cell line 
and cancer terms. Samples annotated to detailed terms in the 
controlled vocabulary were propagated up to organ-level annota- 
tions, based in the organ of origin. Terms with fewer than 10 an- 
notated samples after propagation were excluded at this stage. This 
procedure resulted in a manually annotated compendium of 7539 
samples from 28 tissues that we make available through the 
nanodissection web server for nanodissection analysis. 

Data access 

Normalized gene expression data files of microdissected human 
kidney biopsies have been submitted to the NCBI Gene Expression 
Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/; Edgar et al. 
2002) under accession number GSE47185. 
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